Preceding work for the conservation analysis is omitted aside from histograms plotting window’s scores and percent scores. Here I attempt to define some characteristics of highly conserved R-loops and distinguish them from those which are not..

## GRanges object with 12 ranges and 2 metadata columns:
##                  seqnames              ranges strand |     score  pct_cons
##                     <Rle>           <IRanges>  <Rle> | <numeric> <numeric>
##    window_758542    chr11   65415000-65425000      * |       219   86.9048
##    window_758543    chr11   65417500-65427500      * |       218   86.5079
##    window_758544    chr11   65420000-65430000      * |       218   86.5079
##    window_758545    chr11   65422500-65432500      * |       217   86.1111
##    window_758573    chr11   65492500-65502500      * |       220   87.3016
##              ...      ...                 ...    ... .       ...       ...
##    window_827261    chr10 102125000-102135000      * |       217   86.1111
##    window_842706    chr12     6940000-6950000      * |       217   86.1111
##    window_889895    chr12 124912500-124922500      * |       217   86.1111
##   window_1089166    chr17   76070000-76080000      * |       217   86.1111
##   window_1166796    chr19   42067500-42077500      * |       218   86.5079
##   -------
##   seqinfo: 455 sequences from an unspecified genome; no seqlengths

Metaplot Analysis

I start by creating metaplots for three different groups of windows: the whole set, windows with a percent score greater than 60 and the 12 highest percent scoring windows.

#preparing annotations
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg38.knownGene
promoter <- getPromoters(TxDb=txdb, upstream=3000, downstream=3000)

#preparing rl_cons_60; windows with having more than 60 percent of Rloops aligned to it
limit <- 60
rl_cons_60 <- rl_cons[which(rl_cons$pct_cons >limit)]

tagMatrix_rl_cons <- getTagMatrix(rl_cons, windows = promoter)
tagMatrix_rl_cons_60 <- getTagMatrix(rl_cons_60,windows = promoter)
tagMatrix_top_ranges <- getTagMatrix(rl_cons[top_ranges], windows = promoter)

#creating metaplots for the three groups above
plotAvgProf(tagMatrix_rl_cons, xlim=c(-3000, 3000), xlab="Genomic Region (5'->3') rl_cons")
plotAvgProf(tagMatrix_rl_cons_60, xlim=c(-3000,3000), xlab = "Genomic Region (5'->3') rl_cons_60")
plotAvgProf(tagMatrix_top_ranges,xlim = c(-3000,3000), xlab = "Genomic Region (5' -> 3') top_ranges")
cat("\n")

The metaplot for all peaks produces a very noisy plot and so there is no useful information we could take from it. This should have been expected. The other two metaplots are more like those which we might see when analyzing protien coding genes (like NFE2L2 or BCRA1). However, the peaks are more normally distributed around the TSS and produce a curve which is “flatter” than what a protien coding gene might produce.

After transcription, if the nascent RNA had been modified, (ie. spliced, capped and tailed), the anealing of the RNA to the DNA template would displace regions which had previously coded the premature RNA’s exons. These displaced regions form loops which are known as R-loops[1].

R loops will form across the whole length of a gene being transcribed, and so peaks will align at windows close to a TSS, but in surrounding regions aswell since thes DNA template for these genes may span at least thousands of base pairs. This explains the “flatter” curve presented by the metaplots of rl_cons_60 and the top 12 ranges.

Knowing this, I would expect the see groups of windows with higher percent scores present increasingly flat curves - if more peaks align to the higher scoring windows, then there should be more peaks which are further upstream from the TSS. Windows with lower percent scores would produce curves which are just noise, as in the first meta plot.

After creating the returnTagMatrix function (for better readability), I repeat the process in increments of 10 (except for windows with percent scores greater than 80. There were too few windows to justify ploting 80-90 and/or 90-100 percents score windows).

## $`2`
## 
## $`3`
## 
## $`4`
## 
## $`5`
## 
## $`6`
## 
## $`7`
## 
## $`8`
## 
## $`9`

X labels describe the range of percent score of each plot.

My hypothesis was incorrect. Instead of the curve becoming more flat as I plotted higher scoring windows, each group after 40% showed a simlar curve, normally distributed around the TSS. The same remarks hold true for the higher scoring groups, that peaks would be found at and near the TSS. But as the percent score increases, there doesn’t appear to be any change in where peaks are located. I was also incorrect with lower scoring windows. Between percent scores of 0-39%, curves are inversions of what is seen with higher scoring peaks. It seems then that lesser conserved RLoops are unlikely to be found near the TSS of a gene and more conserved Rloops are more likely to be found near the TSS.

Following this, the whole set of windows were ranked in ascending order of percent score, then split into 10 groups, each with an equal number of windows. Metaplots were then generated from these new groups.

## $`1`
## 
## $`2`
## 
## $`3`
## 
## $`4`
## 
## $`5`
## 
## $`6`
## 
## $`7`
## 
## $`8`
## 
## $`9`
## 
## $`10`

The same trend shown by the first group of metaplots is true for the transformed data as well; lower percent scoring windows tend to be further from the TSS compared to higher percenting scoring windows. However, a large portion of the whole set of windows are located hundreds of basepairs from a TSS. This raises the question of whether this large portion of windows, as well as the whole set of windows, are located near r-loop forming regions(RLFRs). And given that - unlike lower percent scoring windows - higher percent scoring windows tend to be close to a TSS, could there exist a decrepency between higher and lower scoring windows in their relationship to RLFRs.

The explore this question, a permutation test over rl_cons is run against a region set describing r-loop forming regions. This was done first on the whole set of windows,

## $mfrow
## [1] 1 1
## 
## $`regioneR::numOverlaps`
## P-value: 9.99900009999e-05
## Z-score: 21.2288
## Number of iterations: 10000
## Alternative: greater
## Evaluation of the original region set: 482220
## Evaluation function: regioneR::numOverlaps
## Randomization function: regioneR::circularRandomizeRegions
## 
## attr(,"class")
## [1] "permTestResultsList"

as well as each tenth.

## $mfrow
## [1] 1 1
## 
## $`regioneR::numOverlaps`
## P-value: 9.99900009999e-05
## Z-score: 5.15
## Number of iterations: 10000
## Alternative: greater
## Evaluation of the original region set: 27314
## Evaluation function: regioneR::numOverlaps
## Randomization function: regioneR::circularRandomizeRegions
## 
## attr(,"class")
## [1] "permTestResultsList"
## $mfrow
## [1] 1 1
## 
## $`regioneR::numOverlaps`
## P-value: 9.99900009999e-05
## Z-score: 5.4987
## Number of iterations: 10000
## Alternative: greater
## Evaluation of the original region set: 29546
## Evaluation function: regioneR::numOverlaps
## Randomization function: regioneR::circularRandomizeRegions
## 
## attr(,"class")
## [1] "permTestResultsList"
## $mfrow
## [1] 1 1
## 
## $`regioneR::numOverlaps`
## P-value: 9.99900009999e-05
## Z-score: 8.2412
## Number of iterations: 10000
## Alternative: greater
## Evaluation of the original region set: 31253
## Evaluation function: regioneR::numOverlaps
## Randomization function: regioneR::circularRandomizeRegions
## 
## attr(,"class")
## [1] "permTestResultsList"
## $mfrow
## [1] 1 1
## 
## $`regioneR::numOverlaps`
## P-value: 9.99900009999e-05
## Z-score: 10.971
## Number of iterations: 10000
## Alternative: greater
## Evaluation of the original region set: 36026
## Evaluation function: regioneR::numOverlaps
## Randomization function: regioneR::circularRandomizeRegions
## 
## attr(,"class")
## [1] "permTestResultsList"
## $mfrow
## [1] 1 1
## 
## $`regioneR::numOverlaps`
## P-value: 9.99900009999e-05
## Z-score: 10.9781
## Number of iterations: 10000
## Alternative: greater
## Evaluation of the original region set: 37770
## Evaluation function: regioneR::numOverlaps
## Randomization function: regioneR::circularRandomizeRegions
## 
## attr(,"class")
## [1] "permTestResultsList"
## $mfrow
## [1] 1 1
## 
## $`regioneR::numOverlaps`
## P-value: 9.99900009999e-05
## Z-score: 12.835
## Number of iterations: 10000
## Alternative: greater
## Evaluation of the original region set: 39699
## Evaluation function: regioneR::numOverlaps
## Randomization function: regioneR::circularRandomizeRegions
## 
## attr(,"class")
## [1] "permTestResultsList"
## $mfrow
## [1] 1 1
## 
## $`regioneR::numOverlaps`
## P-value: 9.99900009999e-05
## Z-score: 13.8321
## Number of iterations: 10000
## Alternative: greater
## Evaluation of the original region set: 43705
## Evaluation function: regioneR::numOverlaps
## Randomization function: regioneR::circularRandomizeRegions
## 
## attr(,"class")
## [1] "permTestResultsList"
## $mfrow
## [1] 1 1
## 
## $`regioneR::numOverlaps`
## P-value: 9.99900009999e-05
## Z-score: 16.9293
## Number of iterations: 10000
## Alternative: greater
## Evaluation of the original region set: 49588
## Evaluation function: regioneR::numOverlaps
## Randomization function: regioneR::circularRandomizeRegions
## 
## attr(,"class")
## [1] "permTestResultsList"
## $mfrow
## [1] 1 1
## 
## $`regioneR::numOverlaps`
## P-value: 9.99900009999e-05
## Z-score: 21.9094
## Number of iterations: 10000
## Alternative: greater
## Evaluation of the original region set: 67992
## Evaluation function: regioneR::numOverlaps
## Randomization function: regioneR::circularRandomizeRegions
## 
## attr(,"class")
## [1] "permTestResultsList"
## $mfrow
## [1] 1 1
## 
## $`regioneR::numOverlaps`
## P-value: 9.99900009999e-05
## Z-score: 29.8039
## Number of iterations: 10000
## Alternative: greater
## Evaluation of the original region set: 119325
## Evaluation function: regioneR::numOverlaps
## Randomization function: regioneR::circularRandomizeRegions
## 
## attr(,"class")
## [1] "permTestResultsList"
## $mfrow
## [1] 1 1
## 
## $`regioneR::numOverlaps`
## P-value: 0.304869513048695
## Z-score: 0.041
## Number of iterations: 10000
## Alternative: greater
## Evaluation of the original region set: 2
## Evaluation function: regioneR::numOverlaps
## Randomization function: regioneR::circularRandomizeRegions
## 
## attr(,"class")
## [1] "permTestResultsList"
## $`1`
## $`1`$mfrow
## [1] 1 1
## 
## 
## $`2`
## $`2`$mfrow
## [1] 1 1
## 
## 
## $`3`
## $`3`$mfrow
## [1] 1 1
## 
## 
## $`4`
## $`4`$mfrow
## [1] 1 1
## 
## 
## $`5`
## $`5`$mfrow
## [1] 1 1
## 
## 
## $`6`
## $`6`$mfrow
## [1] 1 1
## 
## 
## $`7`
## $`7`$mfrow
## [1] 1 1
## 
## 
## $`8`
## $`8`$mfrow
## [1] 1 1
## 
## 
## $`9`
## $`9`$mfrow
## [1] 1 1
## 
## 
## $`10`
## $`10`$mfrow
## [1] 1 1
## 
## 
## $`11`
## $`11`$mfrow
## [1] 1 1

In reviewing the results of the test for rl_cons it can no longer be assumed that there is no relationship between the whole set of windows and r-loop forming regions (given a significance threshold of 0.05). The same can also be said for each tenth in the ranked set.

The z score tends to increase with the percent score of each tenth.

Local z score across all tenths remains relatively flat, suggesting a regional relationship between rl_cons and the r-loop forming regions. This may be due to the windows being relatively large(10kb).

Peak annotation and upset plots

Following this, I annotated the windows (by the same 10 increment groupings) and used upset plots to display the data. From our understanding of R-loops, I would expect that higher scoring windows would have a greater amount of peaks which span all of 3’ and 5’ untranslated regions (UTR), promoter, exons and introns. Lower scoring groups would have most peaks overlapping 1 or two of these regions.

## >> preparing features information...      2021-07-12 10:24:48 AM 
## >> identifying nearest features...        2021-07-12 10:24:48 AM 
## >> calculating distance from peak to TSS...   2021-07-12 10:24:50 AM 
## >> assigning genomic annotation...        2021-07-12 10:24:50 AM 
## >> assigning chromosome lengths           2021-07-12 10:25:09 AM 
## >> done...                    2021-07-12 10:25:09 AM
## Warning: Removed 6 rows containing non-finite values (stat_count).
## >> preparing features information...      2021-07-12 10:25:44 AM 
## >> identifying nearest features...        2021-07-12 10:25:44 AM 
## >> calculating distance from peak to TSS...   2021-07-12 10:25:45 AM 
## >> assigning genomic annotation...        2021-07-12 10:25:45 AM 
## >> assigning chromosome lengths           2021-07-12 10:25:49 AM 
## >> done...                    2021-07-12 10:25:49 AM
## Warning: Removed 1 rows containing non-finite values (stat_count).
## >> preparing features information...      2021-07-12 10:26:05 AM 
## >> identifying nearest features...        2021-07-12 10:26:05 AM 
## >> calculating distance from peak to TSS...   2021-07-12 10:26:06 AM 
## >> assigning genomic annotation...        2021-07-12 10:26:06 AM 
## >> assigning chromosome lengths           2021-07-12 10:26:09 AM 
## >> done...                    2021-07-12 10:26:09 AM
## >> preparing features information...      2021-07-12 10:26:11 AM 
## >> identifying nearest features...        2021-07-12 10:26:11 AM 
## >> calculating distance from peak to TSS...   2021-07-12 10:26:11 AM 
## >> assigning genomic annotation...        2021-07-12 10:26:11 AM 
## >> assigning chromosome lengths           2021-07-12 10:26:15 AM 
## >> done...                    2021-07-12 10:26:15 AM
## >> preparing features information...      2021-07-12 10:26:16 AM 
## >> identifying nearest features...        2021-07-12 10:26:16 AM 
## >> calculating distance from peak to TSS...   2021-07-12 10:26:17 AM 
## >> assigning genomic annotation...        2021-07-12 10:26:17 AM 
## >> assigning chromosome lengths           2021-07-12 10:26:21 AM 
## >> done...                    2021-07-12 10:26:21 AM
## >> preparing features information...      2021-07-12 10:26:21 AM 
## >> identifying nearest features...        2021-07-12 10:26:21 AM 
## >> calculating distance from peak to TSS...   2021-07-12 10:26:22 AM 
## >> assigning genomic annotation...        2021-07-12 10:26:22 AM 
## >> assigning chromosome lengths           2021-07-12 10:26:25 AM 
## >> done...                    2021-07-12 10:26:25 AM
## >> preparing features information...      2021-07-12 10:26:25 AM 
## >> identifying nearest features...        2021-07-12 10:26:25 AM 
## >> calculating distance from peak to TSS...   2021-07-12 10:26:25 AM 
## >> assigning genomic annotation...        2021-07-12 10:26:25 AM 
## >> assigning chromosome lengths           2021-07-12 10:26:28 AM 
## >> done...                    2021-07-12 10:26:28 AM
## >> preparing features information...      2021-07-12 10:26:29 AM 
## >> identifying nearest features...        2021-07-12 10:26:29 AM 
## >> calculating distance from peak to TSS...   2021-07-12 10:26:29 AM 
## >> assigning genomic annotation...        2021-07-12 10:26:29 AM 
## >> assigning chromosome lengths           2021-07-12 10:26:32 AM 
## >> done...                    2021-07-12 10:26:32 AM
## $`2`
## 
## $`3`
## 
## $`4`
## 
## $`5`
## 
## $`6`
## 
## $`7`
## 
## $`8`
## 
## $`9`

My hypothesis seems to hold true per the upset plots; higher scoring windows have more peaks which span the 5 relevant regions (introns, exons, promoters, 3’ and 5’ UTRs), whereas lower scoring peaks have fewer peaks spanning the 5 regions. This divide between higher and lower scoring windows is the same as the meta plots - after a 40% score, there is appears to be a divide in the trends which each group shows.

Gene data tables

Below is the set of data tables storing the genetic information (start, end, width, name and description) for genes contained by each window. Again, windows were separated based on their percent scores.

##   1613 genes were dropped because they have exons located on both strands
##   of the same reference sequence or on more than one reference sequence,
##   so cannot be represented by a single genomic range.
##   Use 'single.strand.genes.only=FALSE' to get all the genes in a
##   GRangesList object, or use suppressMessages() to suppress this message.
## Annotate peaks by annoPeaks, see ?annoPeaks for details.
## maxgap will be ignored.
## 
## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html

## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html

## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html

## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html

## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html

## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html
## $`2`
## NULL
## 
## $`3`
## NULL
## 
## $`4`
## NULL
## 
## $`5`
## NULL
## 
## $`6`
## NULL
## 
## $`7`
## NULL
## 
## $`8`
## NULL
## 
## $`9`
## NULL

Gene data tables of the tenths using the same methods:

## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html

## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html

## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html

## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html

## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html

## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html

## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html

## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html

## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html

## Warning in instance$preRenderHook(instance): It seems your data is too big
## for client-side DataTables. You may consider server-side processing: https://
## rstudio.github.io/DT/server.html
## $`1`
## NULL
## 
## $`2`
## NULL
## 
## $`3`
## NULL
## 
## $`4`
## NULL
## 
## $`5`
## NULL
## 
## $`6`
## NULL
## 
## $`7`
## NULL
## 
## $`8`
## NULL
## 
## $`9`
## NULL
## 
## $`10`
## NULL
## 
## $`11`
## NULL
library(enrichR)
setEnrichrSite("Enrichr")
websiteLive<- TRUE

if(is.null(listEnrichrDbs()))
  print("DATABASES ARE NULL")
  websiteLive<-FALSE

Enrichment analysis

Below is an enrichment analysis of some percent scoring groups. For each analysis I use the following work flow; a P.value threshold x (0.01<=x<=0.05) was used, then the set of terms were sorted in decending order [5][6].

For the 80-100% scoring group, I first began by comparing its gene set to the ChEA database.

-The highest ranking pathway is for the IKZF1 gene in male colorectal carcinoma cell lines [7]. IKZF1: expression is restricted to fetal and adult hemo-lymphopoietic system and functions as a regulator of lymphocyte differentiation [8]. All three of the genes enriched in this pathway serve some function to transfer phosphates. PRKCG code for a kinase, NPR1 is a guanylate cyclase and PTPN6 is a phosphatase. However, the kinase encoded by PRKCG is only exprssed in the brain and spinal cord [9].

Breakdown of top 5(human cell lines only):

80-100% Conservation

https://maayanlab.cloud/Enrichr/enrich?dataset=ae2bceb7fa582a59a487f4a25e3f2157 * IKZF1/HCT116 - Regulator of lymphoctye differentiation; colorectal cancer * KDM5A/BREAST - Gene regulation via histone demethylation[10]; * MYC/BL - Proto-oncogene, dimerizes and complexes with TF MAX to regulate transcription of target genes[11]; Burkitt lymphoma * DACH1/MDA-MB-231 - C ion and cell fate determination during development, highly conserved Ski domain between Drosophila and human[12]; late stage breast cancer model * LXR/THP-1 - cholestersol transport and now lipid and carbohydrate metabolism, cell diferentiation and apoptosis[13]; peripheral blood monocyte

rl80100 <- rl_withGenes[which(rl_withGenes$pct_cons >80 & rl_withGenes$pct_cons <100 )]
rl80100_symbols <- unique(rl80100$symbol)
enriched80100 <- enrichr(rl80100_symbols, c("ChEA_2016","GO_Molecular_Function_2018", "GO_Cellular_Component_2018", "GO_Biological_Process_2018"))
## Uploading data to Enrichr... Done.
##   Querying ChEA_2016... Done.
##   Querying GO_Molecular_Function_2018... Done.
##   Querying GO_Cellular_Component_2018... Done.
##   Querying GO_Biological_Process_2018... Done.
## Parsing results... Done.
e80100<-enriched80100[["ChEA_2016"]]
#datatable(e[e$P.value<=0.05,])
datatable(e80100[e80100$P.value>=0.01 & e80100$P.value<=0.05,],
          caption = "80-100% conservation",
          options = list(
          dom = 't',
          scrollX = TRUE,
          fixedColumns = list(leftColumns = 1),
          scrollCollapse = TRUE))

70-80% Conservation

https://maayanlab.cloud/Enrichr/enrich?dataset=79c154d7f327e99466fcf36bb1b59a39 * ZNF652/ZR75-1 - Zinc finger[14]; mammary gland/breast/duct epithelial * PKCTHETA/BREAST - Activation of mature T cells [15]; * DMRT1/FETAL Ovary - Vertebrate sex determination [16]; * P68/HELA - RNA helicase, overexpresssion is implicated in several cancers[16]; cancerous-cervical epithelial * CREB1/LNCaP-abl - Induces transcription of genes in reponse to hormonal stimulation of cAMP pathway[18]; Prostate carcinoma

rl7080 <- rl_withGenes[which(rl_withGenes$pct_cons >70 & rl_withGenes$pct_cons <80 )]
rl7080_symbols <- unique(rl7080$symbol)
enriched7080 <- enrichr(rl7080_symbols, c("ChEA_2016","GO_Molecular_Function_2018", "GO_Cellular_Component_2018", "GO_Biological_Process_2018"))
## Uploading data to Enrichr... Done.
##   Querying ChEA_2016... Done.
##   Querying GO_Molecular_Function_2018... Done.
##   Querying GO_Cellular_Component_2018... Done.
##   Querying GO_Biological_Process_2018... Done.
## Parsing results... Done.
e7080<-enriched7080[["ChEA_2016"]]
#datatable(e[e$P.value<=0.05,])
datatable(e7080[e7080$P.value>=0.01 & e7080$P.value<=0.05,],
          options = list(
          dom = 't',
          scrollX = TRUE,
          fixedColumns = list(leftColumns = 1),
          scrollCollapse = TRUE))

60-70% Conservation

https://maayanlab.cloud/Enrichr/enrich?dataset=5ee5a9c14b529497c6dda82f4a953f18 * EGR1/THP-1 - products of targets are required for differentiation and mitogenisis[19]; peripheral blood monocyte * ZNF652/ZR75-1 - Zinc finger[14]; mammary gland/breast/duct epithelial * KDM6A/U937 and SAOS2 - catalyzes demethylation of histone H3[20]; Pleura/pleural effusion, lymphocyte, myeloid monocyte Histiocytic lymphoma + Bone epithelial Osteosarcoma * TP53/ IMR90 - Enducing cell cycle arrest, apoptosis, senescence and DNA repair in response to cellular stress[21];Lung fibroblast - normal * PKCTHETA/BREAST - - Activation of mature T cells [15];

rl6070 <- rl_withGenes[which(rl_withGenes$pct_cons >60 & rl_withGenes$pct_cons <70 )]
rl6070_symbols <- unique(rl6070$symbol)
enriched6070 <- enrichr(rl6070_symbols, c("ChEA_2016","GO_Molecular_Function_2018", "GO_Cellular_Component_2018", "GO_Biological_Process_2018"))
## Uploading data to Enrichr... Done.
##   Querying ChEA_2016... Done.
##   Querying GO_Molecular_Function_2018... Done.
##   Querying GO_Cellular_Component_2018... Done.
##   Querying GO_Biological_Process_2018... Done.
## Parsing results... Done.
e6070<-enriched6070[["ChEA_2016"]]
#datatable(e[e$P.value<=0.05,])
datatable(e6070[e6070$P.value>=0.01 & e6070$P.value<=0.05,],
          options = list(
          dom = 't',
          scrollX = TRUE,
          fixedColumns = list(leftColumns = 1),
          scrollCollapse = TRUE))

50-60% Conservation

https://maayanlab.cloud/Enrichr/enrich?dataset=241b3fb3a27245fc21eea20ca0626605 Unorderdered rankings * ELF3/PDAC-Cell line - Associated with Sarcoma, Synovial and Ampulla of Vater Cancer [22]; Pancreatic Ductal adenocarcinoma * KLF4/PDAC-Cell line - Mediates p53(tumor supressor) gene, thereby controlling G1-to-S cell cycle transition after Dna damage[23];Pancreatic Ductal adenocarcinoma * EGR1/THP-1 - products of targets are required for differentiation and mitogenisis[19]; peripheral blood monocyte * NR1H3/ AtherOSCLEROTIC-FOAM - Family members are regulators of macrophage function (controlling transcriptional programs involved in lipid homeostasis and inflammation) [24]; “Lipid filled” macrophages

rl5060 <- rl_withGenes[which(rl_withGenes$pct_cons >50 & rl_withGenes$pct_cons <60 )]
rl5060_symbols <- unique(rl5060$symbol)
enriched5060 <- enrichr(rl5060_symbols, c("ChEA_2016","GO_Molecular_Function_2018", "GO_Cellular_Component_2018", "GO_Biological_Process_2018"))
## Uploading data to Enrichr... Done.
##   Querying ChEA_2016... Done.
##   Querying GO_Molecular_Function_2018... Done.
##   Querying GO_Cellular_Component_2018... Done.
##   Querying GO_Biological_Process_2018... Done.
## Parsing results... Done.
e5060<-enriched5060[["ChEA_2016"]]
#datatable(e[e$P.value<=0.05,])
datatable(e5060[e5060$P.value>=0.01 & e5060$P.value<=0.05,],
          options = list(
          dom = 't',
          scrollX = TRUE,
          fixedColumns = list(leftColumns = 1),
          scrollCollapse = TRUE))

40-50% Conservation

https://maayanlab.cloud/Enrichr/enrich?dataset=f3ef676d09ac7fd0ee717c2cd6c7f577 * Percent score 40-50%; very low combined score. Therefore lower Z score; pathways are not significantly enriched compared to the mean rank at this conservation level(?). * General trend before this group was a decrease in number of terms which are enriched. * E2F1/MCF-7 - plays a role in cell cycle progression and controls tumor suppresor protiens [25];Breast epithelial Adenocarcinoma * EOMES/HESCs - Development of the mesoderm and and central nervous system in vertebrates [26]. Recently, the tf also regulates the differentiations of certain NKT cells in the thymus[27]; * ELK1/HELA - Implicated to target several genes which have roles in certain neurological phenotypes[29];cancerous-cervical epithelial * MYC/BL - Proto-oncogene, dimerizes and complexes with TF MAX to regulate transcription of target genes[11]; Burkitt lymphoma * SA1/ ERYTHROID - Encodes a compoment of cohesin(under the alias STAG1) [28];

rl4050 <- rl_withGenes[which(rl_withGenes$pct_cons >40 & rl_withGenes$pct_cons <50 )]
rl4050_symbols <- unique(rl4050$symbol)
enriched4050 <- enrichr(rl4050_symbols, c("ChEA_2016"))
## Uploading data to Enrichr... Done.
##   Querying ChEA_2016... Done.
## Parsing results... Done.
e4050<-enriched4050[["ChEA_2016"]]
#datatable(e[e$P.value<=0.05,])
datatable(e4050[e7080$P.value>=0.01 & e7080$P.value<=0.05,],
          options = list(
          dom = 't',
          scrollX = TRUE,
          fixedColumns = list(leftColumns = 1),
          scrollCollapse = TRUE))

30-40% Conservation

https://maayanlab.cloud/Enrichr/enrich?dataset=3ea142080e4a4d7c3de3d0d4f5d9e3b9 - E2F7/HELA - Cell cycle progression[31]; cancerous-cervical epithelial - FOXM1/OE33 and U2OS - Implicated in DNA drepair, senscence, and tumor proliferation[30]; Human esophagus and Epithelial bone Osteosarcoma (respectively)

rl3040 <- rl_withGenes[which(rl_withGenes$pct_cons >30 & rl_withGenes$pct_cons <40 )]
rl3040_symbols <- unique(rl3040$symbol)
enriched3040 <- enrichr(rl3040_symbols, c("ChEA_2016","GO_Molecular_Function_2018", "GO_Cellular_Component_2018", "GO_Biological_Process_2018"))
## Uploading data to Enrichr... Done.
##   Querying ChEA_2016... Done.
##   Querying GO_Molecular_Function_2018... Done.
##   Querying GO_Cellular_Component_2018... Done.
##   Querying GO_Biological_Process_2018... Done.
## Parsing results... Done.
e3040<-enriched3040[["ChEA_2016"]]
#datatable(e[e$P.value<=0.05,])
datatable(e3040[e3040$P.value>=0.01 & e3040$P.value<=0.05,],
          options = list(
          dom = 't',
          scrollX = TRUE,
          fixedColumns = list(leftColumns = 1),
          scrollCollapse = TRUE))

20-30% Conservation

https://maayanlab.cloud/Enrichr/enrich?dataset=0ccdc80016e7b3d65a0f3b3af2cfbea7

rl2030 <- rl_withGenes[which(rl_withGenes$pct_cons >20 & rl_withGenes$pct_cons <30 )]
rl2030_symbols <- unique(rl2030$symbol)
enriched2030 <- enrichr(rl2030_symbols, c("ChEA_2016","GO_Molecular_Function_2018", "GO_Cellular_Component_2018", "GO_Biological_Process_2018"))
## Uploading data to Enrichr... Done.
##   Querying ChEA_2016... Done.
##   Querying GO_Molecular_Function_2018... Done.
##   Querying GO_Cellular_Component_2018... Done.
##   Querying GO_Biological_Process_2018... Done.
## Parsing results... Done.
e2030<-enriched2030[["ChEA_2016"]]
#datatable(e[e$P.value<=0.05,])
datatable(e2030[e2030$P.value>=0.01 & e2030$P.value<=0.05,])
rl1020 <- rl_withGenes[which(rl_withGenes$pct_cons >10 & rl_withGenes$pct_cons <20 )]
rl1020_symbols <- unique(rl1020$symbol)
enriched1020 <- enrichr(rl1020_symbols, c("ChEA_2016","GO_Molecular_Function_2018", "GO_Cellular_Component_2018", "GO_Biological_Process_2018"))
## Uploading data to Enrichr... Done.
##   Querying ChEA_2016... Done.
##   Querying GO_Molecular_Function_2018... Done.
##   Querying GO_Cellular_Component_2018... Done.
##   Querying GO_Biological_Process_2018... Done.
## Parsing results... Done.
e1020<-enriched1020[["ChEA_2016"]]
#datatable(e[e$P.value<=0.05,])
datatable(e1020[e1020$P.value>=0.01 & e1020$P.value<=0.05,])
rl10 <- rl_withGenes[which(rl_withGenes$pct_cons <10 )]
rl10_symbols <- unique(rl10$symbol)
enriched10 <- enrichr(rl10_symbols, c("ChEA_2016","GO_Molecular_Function_2018", "GO_Cellular_Component_2018", "GO_Biological_Process_2018"))
## No genes have been given
e10<-enriched10[["ChEA_2016"]]
#datatable(e[e$P.value<=0.05,])
datatable(e10[e10$P.value>=0.01 & e10$P.value<=0.05,])
dataLst <- list(
  "10-13% Conservation" = c(rlAnnoRankedList[[1]]$symbol),
  "13-17% Conservation" = c(rlAnnoRankedList[[2]]$symbol),
  "17-22% Conservation" = c(rlAnnoRankedList[[3]]$symbol),
  "22-27% Conservation" = c(rlAnnoRankedList[[4]]$symbol),
  "27-33% Conservation" = c(rlAnnoRankedList[[5]]$symbol),
  "33-39% Conservation" = c(rlAnnoRankedList[[6]]$symbol),
  "39-46% Conservation" = c(rlAnnoRankedList[[7]]$symbol),
  "46-54% Conservation" = c(rlAnnoRankedList[[8]]$symbol),
  "54-63% Conservation" = c(rlAnnoRankedList[[9]]$symbol),
  "63-87% Conservation " = c(rlAnnoRankedList[[10]]$symbol)

)

enrichLinks <- lapply(names(dataLst), function(group) {
  genes<- dataLst[[group]]
  response <- httr::POST(url = 'https://maayanlab.cloud/Enrichr/addList', body = list(
    'list' = paste0(genes, collapse = "\n"),
    'description' = group
  ))
  jsonlite::fromJSON(httr::content(response, as = "text"))
})
names(enrichLinks) <- names(dataLst)
for(i in enrichLinks){print(paste0("https://maayanlab.cloud/Enrichr/enrich?dataset=",i$shortId))}
## [1] "https://maayanlab.cloud/Enrichr/enrich?dataset=56510a5e777ea3d21967b35a981f9233"
## [1] "https://maayanlab.cloud/Enrichr/enrich?dataset=acc574354a42f4cf330c483f8bfad96b"
## [1] "https://maayanlab.cloud/Enrichr/enrich?dataset=34b8911a71c6eb7836d97d98859f96c3"
## [1] "https://maayanlab.cloud/Enrichr/enrich?dataset=1b5b0d1a51dcde93aecc53aaf04a1f55"
## [1] "https://maayanlab.cloud/Enrichr/enrich?dataset=b6c5a21c1039439ca017f44091a8ff87"
## [1] "https://maayanlab.cloud/Enrichr/enrich?dataset=851d0d40342c4cb5b93874d8f7d9e396"
## [1] "https://maayanlab.cloud/Enrichr/enrich?dataset=bd7003f716b0b9de5546d3932d753e41"
## [1] "https://maayanlab.cloud/Enrichr/enrich?dataset=2f7a1e9fc338c614192a47c4bfc5b574"
## [1] "https://maayanlab.cloud/Enrichr/enrich?dataset=5841569f1ecd25ee48fc34201101bd01"
## [1] "https://maayanlab.cloud/Enrichr/enrich?dataset=4b2a4fd38b57d656189343412789be26"

General observations:

Many of the higher ranking pathways within each conservation group play a role in cell differentation or cell cycle progression. As well, many of these pathways are implicated in the regulation of tumor supressing or proliferating pathways. A trend within the overall data set is that there is a decrease in the number of enriched pathways as percent score decreases - with the exception of the 40-50% group and the groups with less than 30% conservation. For the former, there are no enriched pathways. Perhaps there is a correlation between the enrichment of transcription factors and their closeness to a TSS (see Metaplot analysis for <30% conseration groups).

Discussion/Conclusion

Using the percent score as a metric for what is a highly conserved R loop prooves to be an effective tool which aligns with what we currently know about Rloops. Windows with higher percent scores are more likely to share characteristics which are expected of Rloop regions. High percent scoring windows are more likely to be closer to the TSS of gene and still have a notable presence surrounding it. Conversely, lower percent scoring windows have little presence near the TSS. Upset plots show a similar distinction between lower percent scoring windows and high percent scoring windows; as the percent score increases there is a greater presence of peaks which overlap whole genes. This is expected as the region which Rloops span may overlap large portions of a gene on a template strand. Finally, we see that pathwys which are implicated in Rloop formation - such as the M phase and transcription and translational roles - are more present in windows with higher percent scores.

From here there are several avenues on which to continue this sort of work. Some next steps might be to analyze how Rloops are conservered among different species. Another might be to analyze the correlation between conserved genes and Rloops which may form from them - this idea comes the the presence of the Rho GTPase pathway in many percent score groups. More over, one could explore the presence of neuronal pathways in less conserved Rloops, as Rloops are implicated in many neurological diseases [1]. Finally, one could look at the difference in CG content between lesser and highly conserved Rloops. Rloops are found to form over unmethylated CpG islands [4] (which are characteristic of euchromatic regions) and so perhaps this may be another characteristic of highly conserved Rloops.

It is important to note that any trend extracted from analyzing the 70-100% percent scoring groups may have a bias due to their sample size (there are less than 500 windows with a percent score greater than 70).

  1. https://www.sciencedirect.com/science/article/pii/S1097276519300449#bib30
  2. https://www.mechanobio.info/what-is-mechanosignaling/what-are-small-gtpases/what-are-rho-gtpases/
  3. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3197203/
  4. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4955522/
  5. https://www.researchgate.net/post/Enrichr_what_value_of_combined_score_is_significant
  6. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-128
  7. https://www.atcc.org/products/all/CCL-247.aspx
  8. https://www.genecards.org/cgi-bin/carddisp.pl?gene=IKZF1#summaries
  9. https://www.genecards.org/cgi-bin/carddisp.pl?gene=PRKCG#summaries
  10. https://www.genecards.org/cgi-bin/carddisp.pl?gene=KDM5A#summaries
  11. https://www.genecards.org/cgi-bin/carddisp.pl?gene=MYC#:~:text=MYC%20(MYC%20Proto%2DOncogene%2C,Bcl2%20And%2FOr%20Bcl6%20Rearrangement.

https://jbiol.biomedcentral.com/articles/10.1186/jbiol217 THO complex 12. https://www.genecards.org/cgi-bin/carddisp.pl?gene=DACH1 13. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2792482/ 14. https://www.genecards.org/cgi-bin/carddisp.pl?gene=ZNF652 15. https://academic.oup.com/jb/article-abstract/132/6/841/856026?redirectedFrom=fulltext 16. https://www.genecards.org/cgi-bin/carddisp.pl?gene=DMRT1 17. https://www.nature.com/articles/onc201542 18. https://www.genecards.org/cgi-bin/carddisp.pl?gene=CREB1 19. https://www.genecards.org/cgi-bin/carddisp.pl?gene=EGR1 20. https://www.genecards.org/cgi-bin/carddisp.pl?gene=KDM6A 21. https://www.genecards.org/cgi-bin/carddisp.pl?gene=TP53 22. https://www.genecards.org/cgi-bin/carddisp.pl?gene=ELF3 23. https://www.genecards.org/cgi-bin/carddisp.pl?gene=KLF4 24. https://www.genecards.org/cgi-bin/carddisp.pl?gene=NR1H3 25. https://www.genecards.org/cgi-bin/carddisp.pl?gene=E2F1 26. https://www.genecards.org/cgi-bin/carddisp.pl?gene=EOMES 27. https://www.nature.com/articles/s42003-019-0389-3 28. https://www.genecards.org/cgi-bin/carddisp.pl?gene=STAG1 29. https://www.sciencedirect.com/topics/neuroscience/transcription-factor-elk-1 30. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3588610/#:~:text=FoxM1%20functions%20in%20response%20to,once%20the%20damage%20is%20repaired. 31. https://www.genecards.org/cgi-bin/carddisp.pl?gene=E2F7